Video Object Segmentation with Language Referring Expressions

نویسندگان

Anna Khoreva

Anna Rohrbach

Bernt Schiele

چکیده

Most state-of-the-art semi-supervised video object segmentation methods rely on a pixel-accurate mask of a target object provided for the first frame of a video. However, obtaining a detailed segmentation mask is expensive and time-consuming. In this work we explore an alternative way of identifying a target object, namely by employing language referring expressions. Besides being a more practical and natural way of pointing out a target object, using language specifications can help to avoid drift as well as make the system more robust to complex dynamics and appearance variations. Leveraging recent advances of language grounding models designed for images, we propose an approach to extend them to video data, ensuring temporally coherent predictions. To evaluate our approach we augment the popular video object segmentation benchmarks, DAVIS16 and DAVIS17 with language descriptions of target objects. We show that our approach performs on par with the methods which have access to a pixel-level mask of the target object on DAVIS16 and is competitive to methods using scribbles on the challenging DAVIS17 dataset. Query: "A man in a red sweatshirt performing breakdance" Figure 1: Examples of the proposed approach. Classical semi-supervised video object segmentation relies on an expensive pixel-level mask annotation of a target object in the first frame of a video. We explore a more natural and more practical way of pointing out a target object by providing a language referring expression.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions

Image segmentation from referring expressions is a joint vision and language modeling task, where the input is an image and a textual expression describing a particular region in the image; and the goal is to localize and segment the specific image region based on the given expression. One major difficulty to train such language-based image segmentation systems is the lack of datasets with join...

متن کامل

Grounding Spatio-Semantic Referring Expressions for Human-Robot Interaction

The human language is one of the most natural interfaces for humans to interact with robots. This paper presents a robot system that retrieves everyday objects with unconstrained natural language descriptions. A core issue for the system is semantic and spatial grounding, which is to infer objects and their spatial relationships from images and natural language expressions. We introduce a two-s...

متن کامل

Using Syntax to Ground Referring Expressions in Natural Images

We introduce GroundNet, a neural network for referring expression recognition – the task of localizing (or grounding) in an image the object referred to by a natural language expression. Our approach to this task is the first to rely on a syntactic analysis of the input referring expression in order to inform the structure of the computation graph. Given a parse tree for an input expression, we...

متن کامل

On Referring Expressions in Query Answering over First Order Knowledge Bases

A referring expression in linguistics is any noun phrase identifying an object in a way that will be useful to interlocutors. In the context of a query over a first order knowledge baseK, constant symbols occurring inK are the artifacts usually used as referring expressions in certain answers to the query. In this paper, we begin to explore how this can be usefully extended by allowing a class ...

متن کامل

Development of Behavioral Based System from Sports Video

A system for detecting and analyzing behavior of a sports person from their facial expression extracted from a sports video from the basis of this project. Shot Segmentation, Object Frame Selection, Image Segmentation, Facial Feature Extraction and Facial Expression Recognition (FER) are the major steps included in developing Cognitive Analysis of Facial Expression Recognition system from Sport...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

Video Object Segmentation with Language Referring Expressions

نویسندگان

چکیده

منابع مشابه

Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions

Grounding Spatio-Semantic Referring Expressions for Human-Robot Interaction

Using Syntax to Ground Referring Expressions in Natural Images

On Referring Expressions in Query Answering over First Order Knowledge Bases

Development of Behavioral Based System from Sports Video

عنوان ژورنال:

اشتراک گذاری